Enhancing Document Clustering Using Hybrid Models for Semantic Similarity
نویسندگان
چکیده
Different document representation models have been proposed to measure semantic similarity between documents using corpus statistics. Some of these models explicitly estimate semantic similarity based on measures of correlations between terms, while others apply dimension reduction techniques to obtain latent representation of concepts. This paper proposes new hybrid models that combine explicit and latent analysis to estimate semantic similarity between documents. The proposed models have been used to enhance the performance of document clustering algorithms. Experiments on thirteen benchmark data sets show that hybrid models achieve significant improvement in clustering performance when used with clustering algorithms that are sensitive to errors in estimating document similarity.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملDomain Based Punjabi Text Document Clustering
Text Clustering is a text mining technique which is used to group similar documents into single cluster by using some sort of similarity measure & separating the dissimilar documents. Popular clustering algorithms available for text clustering treats document as conglomeration of words. The syntactic or semantic relations between words are not given any consideration. Many different algorithms ...
متن کاملHybrid Document Indexing with Spectral Embedding
Document representation has a large impact on the performance of document retrieval and clustering algorithms. We propose a hybrid document indexing scheme that combines the traditional bagof-words representation with spectral embedding. This method accounts for the specifics of the document collection and also uses semantic similarity information based on a large scale statistical analysis. Cl...
متن کاملAn Efficient Document Clustering Based on HUBNESS Proportional K-Means Algorithm
Evaluating similarity between the documents is a main operation in the text processing field. Similarity measurement is used to estimate the relationship between the records or documents.In existing system similarity between two documents can be computed with respect to feature by using Similarity Measure for Text Processing (SMTP). In proposed hybrid SMTP scheme is integrated with hubness base...
متن کاملHybrid Approach for Punjabi Text Clustering
Text Clustering is a text mining technique which is used to group similar documents into single cluster by using some sort of similarity measure and placing dissimilar documents into different clusters. Most of the popular clustering algorithms treats document as conglomeration of words and do not consider the syntactic or semantic relations between words. To overcome this drawback, some algori...
متن کامل